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Abstract 

We propose iterative proportional scaling (IPS) via decomposable submodels for 
maximizing likelihood function of a hierarchical model for contingency tables. In 
ordinary IPS the proportional scaling is performed by cycling through the members 
of the generating class of a hierarchical model. We propose to adjust more marginals 
at each step. This is accomplished by expressing the generating class as a union of 
decomposable submodels and cycling through the decomposable models. We prove 
convergence of our proposed procedure, if the amount of scaling is adjusted prop- 
erly at each step. We also analyze the proposed algorithms around the maximum 
likelihood estimate (MLE) in detail. Faster convergence of our proposed procedure 
is illustrated by numerical examples. 

Keywords and phrases: decomposable model, hierarchical model, /-projection, iterative 
proportional fitting, KuUback-Leibler divergence. 

1 Introduction 

Iterative proportional scaling algorithm for contingency tables, first proposed by Deming 
and Stephan [8], has been well studied and generalized by many authors. Ireland and 
Kullback [TT] proved convergence of IPS and Fienberg flO\ gave a simpler proof of con- 
vergence from geometric consideration. Darroch and Ratcliff [6J made a generalization to 
IPS and its geometrical property was studied by Csiszar [5j. Csiszar [4j also gave a more 
general proof of convergence and justified IPS in a general framework. Extension of IPS to 
continuous case was studied in Kullback [H] and Riischendorf [17]. Effective algorithms 
and implementations of IPS have been also studied by many authors, including [1], [9], 

m- 

In this paper, we propose another generalization of IPS based on decomposable sub- 
models. Decomposable models or graph decompositions have been already considered 
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by Jirousek p^, Jirousek and Pfeucil |T3] and Malvestuto |T6]. However they used de- 
composable models for efficient implementation of conventional IPS in the form of tree- 
computation. Here we use decomposable submodels for generalizing IPS itself. In our al- 
gorithm we adjust a larger set of marginals than the conventional IPS. The set of marginals 
form the generating class of a decomposable submodel. By adjusting more marginals, our 
proposed algorithm achieves a faster convergence to the maximum likelihood estimate 
than the conventional IPS, although at present it seems difficult to theoretically prove 
that our procedure is always faster. We prove convergence of our proposed procedure, if 
we adjust the amount of scaling at each step. We also analyze in detail the behavior of 
the proposed algorithms around the maximum likelihood estimate. As shown in Section 
Hlour procedure works well in practice without adjusting the amount of scaling at each 
step. 

As suggested by a referee, it is an important topic to combine the idea of the present 
paper and the tree-computation approach for efficient implementation of IPS. Although 
we do not give a general result, in Section S] we investigate the combination in the case 
of cycle models and show effectiveness of the combination by numerical experiments. 

The organization of this paper is as follows. In Section [2] we summarize notations and 
basic facts on hierarchical models and decomposable models for multiway contingency 
tables. In Section [3] we propose a generalized IPS via decomposable submodels, prove its 
convergence and clarify its behavior close to the maximum likelihood estimate. In Section 
m we perform some numerical experiments to illustrate the effectiveness of the proposed 
procedure. Some discussions are given in Section [51 

2 Preliminaries 

In this section we summarize notations and preliminary materials on decomposable models 
and conventional IPS. 

We follow the notation of Lauritzen [15]. Let A denote the set of variables of a 
multiway contingency table. For each 6 E A, Is = {1,2, ... , Is} denotes the set of levels 
of 6. The set of cells is denoted by X = XseA^s- Let n{i) denote the frequency of a 
cell i E I and let n = denote the total sample size. Throughout the paper we 

denote the relative frequency (empirical distribution) by r{i) = n{i)/n. For a cell i and a 
subset of variables a C A, the marginal cell of i for a is denoted by G Xq = x^g^X^, the 
marginal of r on a is denoted by r [a] , and the marginal relative frequency of a is denoted 
by r{ia). 

The generating class of a hierarchical model is the family of the variable sets indexing 
the maximal interaction terms in the hierarchical model. We denote a hierarchical model 
with generating class C by M{C), and call the sets in C the generators of M{C). A 
hierarchical model is a decomposable model if there exists an ordering (Ci, . . . , Cm) of its 
generators that satisfies the running intersection property: 

(RIP) For each j {2 < j < m), there exists k {1 < k < j — 1), such that 

n (Ci U C2 U ■ ■ ■ U Cj_i) C Ck. 
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Such an ordering is called a perfect sequence. Given a perfect sequence (Ci, . . . , Cm) of 
the generators of a decomposable model, let 

Sj = Cj. n (Ci U ^2 U ■ ■ ■ U Cj^i). (2 < J < m) 

If no Sj is the empty set, then the decomposable model is said to be connected. If this is 
the case, then each set Sj is called a separator of the generating class of the decomposable 
model; moreover, both the generators of the decomposable model and the separators of 
its generating class can be graphically viewed as being the (maximal) cliques and the 
minimal vertex separators of a suitable chordal, connected graph, sometimes called the 
"adjacency graph" of the generating class of the decomposable model. In what follows, 
we always assume that a decomposable model is connected. In this paper 

S = {5*2, . . . , Sm}, 

denotes the multiset of separators. The number of times a separator S appears in S is 
called the multiplicity of 5*. 

The MLE of the cell probabilities {p{i)} under a hierarchical model M{C) is given 
by the probability distribution denoted by pc that belongs to M(C) and satisfies the 
marginality constraints 

Pc[C] = r[C], VC e C. (1) 

Equivalently, pc is the extension of the set of probability distributions {r[C] : C E C} that 
has the maximum entropy. If M(C) is a decomposable model then pc{i) has the following 
product-form expression: 



if r{ic) > 0, VC e C, 



Pc{^ = {]\s^sr{^sy ' ' (2) 

0, otherwise. 

In the following we call pc in ([2]) the maximum- entropy extension of the set of probability 
distributions {r[C] : C G C}. In Algorithm 2 below, we use the maximum-entropy 
extension of the form ([2]) of the set {q[C] : C G C} even when q is not necessarily 
normalized to be a probability distribution. 

For obtaining MLE for other graphical or hierarchical models we need some iterative 
procedure. The following conventional IPS, cycling through the elements of the generating 
class, is commonly used for this purpose. In the following let p^^\i) denote the estimate 
of the probability of the cell i at the t-th step of iteration and let p^*-* = {p'^^\i)}. 

Algorithm (Conventional IPS) 

Let p'^^\i) = 1/|X|. The updating formula is given as 

where C = Cj, j = {t mod m) + 1. 
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The KuUback-Leibler divergence (KL-divergence) from a probability distribution p to 
another probabihty distribution q is denoted by 

p{i) 



The log sum inequality (Chapter 2 of [3J) for non-negative numbers ai,...,aN and 
bi,...,bN is 

N N N 



^Oilog-^ > alog-, aj > 0, 6i > 0, a = ^ai,b = ^bi, 

1=1 * 1=1 i=l 

where a log ^ = oo if a > 0, and log = 0. The equality holds if and only if ai/bi = const. 



3 Iterative proportional scaling via decomposable sub- 
models 

In this section we propose a generalization of conventional IPS and study its properties. At 
each step of our procedure we update a larger set of marginals, which form a decomposable 
submodel. We prove convergence of our proposed procedure, if the amount of scaling is 
adjusted properly at each step. We also give a detailed analysis of our procedure when 
the current estimate is close to MLE. 



3.1 Proposed algorithms 

We now describe our proposed procedure. A model M(C') is a submodel of M{C) if each 
generator of M(C') is contained in some generator of M(C). Let {M(Ci), . . . , M{Cu)} be 
a set of decomposable submodels of M(C) such that each generator of M(C) is contained 
in the generating class of M{Cj) for some j. In this case we say that {M(Ci), . . . , M(C„)} 
spans M{C). 

In our procedure there is a problem of normalization as discussed below. Therefore 
we denote the non-normalized estimated cell probability at the t-th. step by g*^*-' and the 
normalized estimated cell probability by p^*-* . 

Algorithm 1 Let g^^-* = 1/|X|. We cycle through Ci,C2, . . . ,C„ and for the t-th step 
we update the non-normalized estimated cell probabilities as follows 

q{t+i) = q(t)Jj^^ j = (t modM) + l, (4) 

where rj is the maximum-entropy extension of the set of probability distributions {r[C] : 
C G Cj} and gj*^ is the maximum-entropy extension of the set of probability distributions 
{g^*^[C] : C G Cj}, and the normalized cell probabilities as 
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Example 3.1. Consider a 4-way contingency table H x J x K x L and the following 
hierarchical model with generating class C = {{H, J}, {J, K}, {K, L}, {H, L}} ("4-cycle 
model") : 

Phjki = exp{ahj + bjk + Cki + dhi). 

By slight abuse of notation write A = {if, J, K, L}. The following Ci and C2 is an example 
of the family of submodels that spans C. 

Ci = {{//, J},{J,K},{K,L}}, 
C2 = {{//, J},{K,L},{H,L}}. 

For each submodel, the updating procedure is performed as follows. 

^ ^' ^ ^' r(gxr(^,) gW(^.,)xgW(^,,)xgWM' 

^ ^' ^ ^' rith)y<riti) g(*+i)(^/.,)xg(*+i)(^H)xg(*+i)(z,0' 

If we set Ci = {Ci}, . . . ,Cm = {Cm}, Algorithm 1 coincides with the conventional 
IPS. Ci,...,Cm span C. Each Cj is composed of one generator of the model. Hence 
M{Cj) is a decomposable submodel of M{C). Therefore Algorithm 1 is a generalization 
of conventional IPS. In the conventional IPS, p'-*"'"^-* (i) in ([3]) satisfies p^^~^^^{ic) = r{ic), 
which is a likelihood equation in ([T]). But in general p^^~^^\i) in (JSj) does not satisfy ([T]). In 
other words, from a geometric viewpoint of /-projection in Csiszar ([1],[5]), the updating 
rule (HI) is not a projection. We discuss it again in the next section. 

In dl]), we update q^^\i). It should be noted that we have 

because the normalizing constant is canceled on the right-hand side of (jl]). Also it is 
easy to see that, if Algorithm 1 in terms of {q^^\i)} converges, then the limiting g's are 
automatically normalized. 

Unfortunately it is difficult to prove convergence of Algorithm 1. The difficulty lies in 
the fact that the sum ^jgj Q'^*^^-' (^) after updating might exceed 1 (i.e. Q^^^^K'^) > 1) 
in Algorithm 1 even if g*^*-* is normalized as XliQ'*'*^^ — 1- However we recommend it 
because in practice, it works well and has converged to MLE in all of our experiments 
and converges faster than the conventional IPS as shown in Section HI 

In order to deal with the theoretical difficulty concerning the normalization of we 
consider adjusting the amount of updating. At this point, we need the following lemma. 

Lemma 3.1. Let r and q be two probability distributions over I, and M{C') a decom- 
posable model over a nonempty (proper or improper) subset A' of A. Let re be the 
maximum- entropy extension of the set of probability distributions {r[C] : C G C'} and qc 
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be the maximum- entropy extension of the set of probability distributions {q[C] : C G C'}. 
If q[A'] is not an extension of the set of probability distributions {r[C] : C G C'}, then 
there exists a unique aQ > for which the function 



q =qx \ — 

qc 



is a probability distribution. 
Proof. In view of ([2]) we have 



1 = ^^rcii) = ^gc'(^) 



Therefore if 



..<1 (6) 

qcw 

for all i, then the equality in ([6]) holds for all i with q{i) > 0. Therefore under the 
condition of the lemma there exists at least one cell z G X such that 



qc'{i) 



> 1, g(0 > 0. 



Then g(z)(rc'(i)/gc'(^))° this i is strictly convex in a and diverges to +00 as a —>■ 00. 
Write 

Then 5'(a)is also strictly convex in a and diverges to +00 as a ^ 00. 

Write C = {Ci, . . . , C^} and S' = {S2, ■ ■ ■ , S^}. Consider the differential of g{a) at 
a = 0. 



cec ic ' ses' is ^ 



Now 



is the negative of KL-divergence and nonpositive. By the log sum inequality, 
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is also nonpositive ior 2 < k < v. Equality holds if and only if 

ritc)=q^'\tc), yCeC',Wtc. 

Then, except for such a case, g{0) = 1, g'{0) < 0, g{oo) = oo, and g{a) is strictly convex 
in a. Therefore there exists a unique ao > such that g{aQ) = 1. □ 

We now present the following algorithm and its modification based on Lemma 13.11 

Algorithm 2 Let a^^^ > 0. We cycle through Ci,C2, ■ ■ ■ ,Cu and for the t-th step we 
update the unnormalized estimated cell probabilities as 



j = {t mod u) + 1, 



and the normalized cell probabilities as p^^~^^\i) = / ^^^zj (l^^^^\k) . 

Note that also in Algorithm 2 we do not need to normalize at each step and we can 
perform normalization any time, because {q^^\i)} is always proportional to {p'^^\i)}- 

Algorithm 3 We cycle through Ci,C2, ■ ■ ■ ,Cu and for the t-th step we update the 
estimated cell probabilities as follows 





j = {t mod u) + 1, (9) 

where a^^ > is given in Lemma 13.11 with q = g^*) . 

3.2 Correctness of the proposed algorithms 

In this section, we prove the correctness of proposed algorithms. As before let {r{i)} 
denote the empirical distribution and let {pci'i)} denote the MLE. Because we consider 
hierarchical models, the following equation holds ([1], [5]). 

/(r : q) = I{r : pc) + I{pc ■ q)- 

I{r : q) corresponds to the log likelihood. Therefore we can prove the correctness of our 
algorithms by proving I{pc : g*-*^) ^ as t — oo. 

Theorem 3.1. Algorithm 3 converges to MLE. 
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Proof. Consider KL-divergence after updating, 



^ Pc{i) log- 



Pc[i) 



Write Cj = {Ci, . . . , C^} and Sj = {S2, ■ ■ ■ , S^} as in the proof of Lemma [3?T1 Then, 
is a KL-divergence, and nonnegative. By the log sum inequality, 
is also nonnegative ioi 2 <k <v. Therefore, 



holds. Equality holds if and only if r{ic) = Q^^H^c), VC G Cj. We see that I{pc]p^^^) 
always decreases after updating. The rest of the proof is the same as the classical one 

fin). □ 



Corollary 3.1. Using < a*^*-* < Oq*"*, Algorithm 2 converges to MLE. 
Proof. Consider KL-divergence after updating, 

where (/(a) is given in ([7]) with q = q^^\ Because a^*) < Oq'', log(7(a(*^) is nonpositive and 
I{pc;p^^^) always decreases after updating. The rest of the proof is the same as Theorem 

o □ 

At this point we discuss Algorithm 3 from a geometric viewpoint of /-projection in 
the sense of Csiszar ([1], [5j). In our procedure we adjust a larger set of marginals 
than the conventional IPS and in practice KL-divergence decreases more in our proposed 
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algorithms than the conventional IPS for each step. However it is difficult to guarantee this 
theoretically. The difficulty lies in the fact that the updating rule ([9]) is not a projection. 
In fact, if we repeat (Q twice with the same Cj then the cell probabilities change, whereas 
in the conventional IPS repeating the same updating step twice does not change the cell 
probabilities after the ffist update. We can understand the situation as follows. Starting 
from the current estimate {p^^\i)} suppose that we repeat the step with the same 
Cj until the cell probabilities converge to {p*{'i)}- Then the limit {p*{i)} maximizes the 
likelihood function among {p{i)} of the form 

p(z)=pW« HM^c). (10) 

The right-hand side of ffTOl) forms a log-affine model through {p^^\i)} (Section 4.2.3 of 
[T5]). Since updating a single C G Cj in the conventional IPS is a special case of ffTOj) . it 
follows that 

I{pc:pn<I{Pc:p^'^'^'), (11) 

where {p^*'^^^' (i)} is the updated estimate by the conventional IPS for some C E Cj. 
Therefore a larger decrease of KL-divergence of our procedure compared to conventional 
IPS is only guaranteed in the sense of ffTTl) . The situation will become more clear when 
we analyze the behavior of Algorithm 3 close to MLE in the next section. 



3.3 Analysis of behavior close to the maximum likelihood esti- 
mate 

In this section, we study the behavior of our algorithms when the current estimate is 
already close to MLE. We assume that MLE is in the interior of the parameter space and 
Pc{i) > for all i G X. We analyze the behavior of Oq*''. We also consider the value of 
which reduces the KL-divergence most and the value of a*-*^ = such that 
KL-divergence decreases in Algorithm 2 for < a^*-* < 



a 



it) 

2 



We repeatedly use the following expansion. 



log(l + x) =X-y + 0(x^), X^O. (12) 

Assume that the current estimate {p^*\i)} is close to MLE in the following sense. For 
sufficiently small e > and for all C G C, 5 G S, ic, is we have 

pW{tc) p^^>{ts) 

The following proposition describes the behavior of ctg*^ in Algorithm 3. 



9 



Proposition 3.1. Assume {p^*\i)} is close to MLE in the sense of /flgj). Then 



p(t)(ic) M ^565, ( p(')(is) 1 

= 7 r— ^ r— rrf + 0{e). (14) 



Before giving a proof of this Proposition we rewrite the numerator of the right-hand 
side of (fT^ . Let Cj = {Ci, . . . , C^} and iSj = {S2, ■ ■ ■ , Sy}. Then 

Therefore the numerator is nonnegative. Also note that the denominator of the right-hand 
side of ([HD can be written as 



f r{is) _ 



(16) 



We see that the numerator of a^^ consists of the diagonal square terms when we expand 
the square of denominator in the form of f|T6l) . We now give a proof of Proposition 13. 1[ 

Proof. Consider the following expansion, 

Yic^c.'^i^c) Ylses.P^'K^s) ^ r{ic) r{is) 



=0{e). 

Then the s-th derivative of g{a^*^) at is 



9^ K^) =2^P^'{^) log rf — ^ rf (t)f \ 
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The first and the second order derivatives of g{a''^^) at are, 



ceCj ic ^ y oj g^^, y K 

^EE{(..)-.«M)-q^(^-0^ 

-EE{««»)-.'''(«.))-q^(^i)-)>o(., 



and 



Then, we expand g{a^*^) at 0, 

^(aW) = 5(0) + a(*^5(^)(0) + ^-^9^'\0) + 0{e'). 

Assuming normalization at each step of the algorithm, we have g{0) — 1 and substituting 
CKo^ for a^^\ we obtain 



a 



it) 



+ 0{e) 



'WW 

□ 
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+ 0(£). 



Consider f|T5l) and f|T6|) . If the signs of the terms on the right hand side of (|T6l) are "ran- 
dom" then we can expect that a^'' is close to 1. We can imagine that {p^^\i)} converges 
to MLE from various directions. Then is close to 1 "on the average" . Furthermore as 
shown in the following proposition a^^ is the optimum value of the adjustment close to 
MLE. We believe that this is the reason that Algorithm 1 works very well in practice. 

Proposition 3.2. Assume {p^*\i)} is dose to MLE in the sense of / flgj) . Then 

af) =aW + 0(£), (17) 

where af^ is the value of a^^^ which reduces the KL- divergence most. 
Proof. Define by 

F{a^^>) = a^*> > p*{t) log ^— - x ' ..... . , 18 

which corresponds to the decrease of KL-divergence before normalization. Consider the 
derivative of F(q;(*)), 

F'-'{a'-') = } p {t)\ogY^ ,. , X — 



— 1 log: : — r X 




= _^(i)(0) + O(e3). 
Consider the derivative of F{a^^'^) — \ogg{a^^^) and equating 0, we obtain, 



Then 



^«(«f))=^«(0) + aiV)(0) + O(.^), 
g{a?) = g{0) + a?g('\0) + ^-^g^'\0) + 0(r 

= l + 0(£2) 
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and 

Therefore we have 

"S" = ^|||p + 0(.)=ar+0(.). 



□ 



Finally we show that KL-divergence decreases in the range < < 2aJ*\ This 
result indicates that in Algorithm 2, a*^*-* > a^'' often decreases KL-divergence in practice. 

Proposition 3.3. Assume {p^*\i)} is dose to MLE in the sense of ( flgj) . Then 

= 2^*^ + 0(e). (19) 

where ^ is the value of a^^^ such that I{pc : = I(pc : p'-*^) in Algorithm 2. 

Proof. 



and 



= f (a?') - logj(a«) = -a<''9'"(0) - q4"j'"(0) + ^s'^'CO) + 0(£= 



□ 



We show the behavior of log (7(0^*-*) and F(q;^*'') — logg{a^^^) in Figure [H Proposition 
13.11 Proposition 13.21 and Proposition 13.31 indicate that in many cases we can decrease 
KL-divergence by using a*^*-* = 1. In the next section we illustrate this by numerical 
experiments. 



4 Numerical experiments for cycle models 

In this section, we compare our Algorithm 1 with the conventional IPS by numerical exper- 
iments. We consider J-way cycle model with the generating class {{1, 2}, {2, 3}, . . . , { J — 
1, J}, { J, 1}} for J > 4. As a family of decomposable submodels which span the model 
we use the set of two decomposable submodels obtained by deleting one element of gen- 
erating class of the hierarchical model. We show the considered model and its submodels 
in Table [11 where {1,2} is abbreviated as 12. For example in the 5-way case we span 
M5 = {12, 23, 34, 45, 15} by M5 \ {15} and M5 \ {23} as illustrated in Figure M 

Before we present the results of the experiments, we consider the space-saving imple- 
mentation of Algorithm 1. 
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Figure 1: Behavior of log (^(a*^*'*) and ^(0^*-*) — log (^(q;^*'') 



Table 1: The submodels in numerical experiments 



Dim 


Hierarchical model 


Decomposable submodels 


4 


M4 = {12,23,34, 14} 


M4\{14},M4\{23} 


5 


M5 = {12,23,34,45,15} 


M5\{15},M5\{23} 


6 


Me = {12,23,34,45,56,16} 


Me \ {16}, Me \ {34} 


7 


M7 = {12, 23, 34, 45, 56, 67, 17} 


M7\{17},M7\{34} 


8 


Mg = {12, 23, 34, 45, 56, 67, 78, 18} 


Ms \ {18}, Ms \ {45} 




Figure 2: A decomposable submodels in a 5-way case 



4.1 Tree-computation of Algorithm 1 

For the conventional IPS, the implementation of the tree-computation has been considered 
in Jirousek [12] and Jirousek and Pfeucil [13]. Badsberg and Malvestuto[T] improved the 
algorithm by applying the Markovian information propagation techniques with junction 
trees of the triangulated models. 

In this section we apply the Markovian propagation approach to our Algorithm 1 for cy- 
cle models. We triangulate the J- way cycle model by adding the edges {1, 3}, {1, 4}, . . . , {1, J— 
1}. Let D* and S* denote the triangulated model and the set of separators of P*, 

= {{1,2, 3}, {1,3, 4},... ,{1, J -1, J}}, 5* = {{1,3}, {1,4},..., {1, J- 1}}. 
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Then we note that the cell probabilities p{i) satisfy 



Uses* Pi's) ■ 

So we consider the tree-computation algorithm which updates q^'\zc),C eV* instead of 
q^^\i). Let = maxcec* l^cl- While the computational cost per an update procedure 
of Algorithm 1 is 0(|X|), that of the tree-computation algorithm is reduced to 0{\X^\). 
Denote by C* a generator {1, j — 1, j}, j = 1, . . . , J. The junction tree for V* is uniquely 
defined as in Figure S]TJ The decomposable submodels we use are Mj = Mj \{J — 1, J} 
and Mj = Mj \ { J' — 1, J'} for some 1 < J' < J. Direct the junction tree in two ways 
such that Cj and Cj' are the unique sink as in Figure [3] and denote them by Ti and T2, 
respectively. Then the Markovian propagation algorithm proposed here is described as 
information propagation on Ti and T2. 



ct) — fc"A (c}^ — ... fc}_\ — fc} 



Figure 3: The junction tree for the J-waj cycle model 
Ti — /ci~\ — . /cp^ ... — jcr}\ /S^: 



n (ct) — /cjV^ >IG}A^ ... ^c}\^c} 



Figure 4: The directed trees 



Algorithm 4 

Define r^^\i{i,2}) = ^(^1,2}) and rW(i{j_i_j}) = r{i{j_ijy) for all t 
(1) Update via Mj 
for j = 3 to J do 

if j 7^ J, update q^^\ic*) and (^{ij}) by 



(20) 
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and send to C*^^. 

if J = J, update q^^\ic}) by 

(2) Update via Mj 

for j = 3 to J' — 1 

update q^^\ic*) and T(*)(i{ij}) by ([20]) and send ^j+i- 

for j = J to J' + 1 

update q^^\ic*) and (^{i,j-i}) by 

and send to C*_p 

Update q^^\ic*,) by 



□ 



It is easy to show that 



Uses^<l^H^s) UsesA's) Ucec.Q^H^c) 
.^^HO X X (21) 

2T]) looks the same as (jlj). However there are some minor differences. In (jlj) q^^\is) 



J2i q^^\i)- However q^^\is) in (!2T!) is derived by q^^\is) = q^^K^c) for some 

C G C*. Since q^^\is) is not necessarily normalized, q^^^ic) J2i Q^^K^c) for 

C ^ C and C, C" G C*. Hence in general J2i q'^^\ic) 7^ Zli q''^\i). In this sense 
Algorithm 4 is an approximate algorithm for Algorithm 1. 

In the experiments, we compare the performance of Algorithm 4 and the Markovian 
propagation algorithm for the conventional IPS by Badsberg and Malvestuto[I]. 
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4.2 The results of the numerical experiments 

In this section we present the results of numerical experiments. We set Ji = ■ ■ ■ = = / 
and / = 2, 3 or 4. We generated random contingency tables by filling each cell by uniform 
random integers from 1 to 10^ and we obtained MLE by Algorithm 4 and the Markovian 
propagation algorithm for the conventional IPS by Badsberg and Malvestuto[T]. As the 
convergence criterion we used 

j=3 iQ* eicf 

For each dimension and each number of levels, we generated 1000 contingency tables and 
took the average of the CPU time and the number of steps to convergence. Denote by 
r and Tconv the CPU time for Algorithm 4 and the conventional IPS, respectively. Let 
V and z/conv be the number of steps to convergence for Algorithm 4 and the conventional 
IPS, respectively. We also calculated the probability that r < Tconv and u/ucomr- The 
computation was done on a Pentium IV 3.2GIIz CPU machine. 

The results are shown in Table [21 In all of our runs Algorithm 4 converged to MLE. 
The experiments show that Algorithm 4 converges faster when the dimension is larger 
than 7. The computational cost per an update of Algorithm 4 is expected to be larger 
than that of the conventional IPS. As we can see from Table [2], however, the number of 
steps to convergence of Algorithm 4 is smaller than that of the conventional IPS. v/vcom 
gets smaller as the dimension of the model gets larger. Therefore the results of the 
experiments suggest that Algorithm 4 is more efficient than the conventional IPS when 
the dimension of the model is large for general hierarchical models. 

5 Some discussions 

For using the proposed algorithms, we have to find a family of decomposable submodels 
that span a generating class of a hierarchical model. We recommend spanning the gener- 
ating class by a small number of large decomposable submodels. Here large decomposable 
submodels might mean maximal submodels in the sense of model inclusion or submodels 
with largest degrees of freedom. In the literature some methods for finding a maximal 
chordal subgraph of a given graph are studied ([2], [T|, [E]). In the case of graphical 
models, this might give a solution to our problem. However we have to satisfy the con- 
dition that each element of a generating class is contained in at least one decomposable 
submodel. Therefore we need a method to find a maximal chordal subgraph under the 
restriction that specific generators are contained. 

A referee suggested the following simple algorithm. Suppose that a model M{C) with 
|C| = m is given. For each set C in C 

choose an ordering (Ci, . . . Cm) of sets in C such that Ci = C and |Cj_i flCj | = max 

for all j > 2; 
set C = {Ci}] 
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Table 2: CPU time and the number of steps to convergence 
(i) / = 2 





CPU time 


Pr(r < Tconv) 


number of steps 


Uim 


^conv 


T 




'^conv 


V 


y / ^corw 


4 


0.0156 


0.0178 


0.171 




11.652 


6.887 


0.591 


5 


0.0193 


0.0205 


0.237 




9.391 


4.413 


0.470 


6 


0.0233 


0.0198 


0.465 




7.841 


3.348 


0.427 


7 


0.0289 


0.0204 


0.884 




8.000 


3.000 


0.375 


o 
5 


0.0407 


0.0258 


0.957 




9.000 


3.000 


0.333 








(ii) / = 


3 










CPU time 


Pr(r < Tconv) 


number of steps 


Uim 




r 




'^conv 


V 


y / ^com 


4 


0.0337 


0.0440 


0.023 




11.098 


6.428 


0.579 


5 


0.0455 


0.0463 


0.394 




9.345 


4.455 


0.477 


6 


0.0484 


0.0451 


0.469 




7.000 


3.000 


0.429 


7 


0.0697 


0.0559 


0.929 




8.000 


3.000 


0.375 


o 

a 


0.0951 


0.0672 


0.997 




9.000 


3.000 


0.333 








(iii) / = 


4 










CPU time 


Pr(T < Tconv) 


number of steps 


Dim 


^conv 


T 




^conv 


V 


y / ^coav 


4 


0.0665 


0.0941 


0.000 




10.493 


4.943 


0.471 


5 


0.0722 


0.1032 


0.041 




7.080 


2.980 


0.421 


6 


0.1005 


0.1007 


0.324 




7.000 


3.000 


0.429 


7 


0.1437 


0.1254 


0.971 




8.000 


3.000 


0.375 


8 


0.2028 


0.1551 


0.997 




9.000 


3.000 


0.333 



for j = 2, . . . , m 

if C U {Cj} has the running intersection property then set C := C U {Cj}. 
Note that testing the running intersection property on a set family takes linear time [18] . 

In this paper we compared various algorithms of IPS in terms of the CPU time to 
convergence. We showed that proposed algorithm converges faster than conventional IPS 
when the model is large by numerical experiments. We consider the implementation of 
the tree-computation of Algorithm 1 only in the case of cycle models. It may be possible 
to implement the tree-computation of Algorithm 1 for general hierarchical model when 
the decomposable submodels are given. This topic needs further investigation and is left 
to our future research. 
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